Pesquisa | Portal Regional da BVS

1.

Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments.

Zhang, Xu; Zhang, Xiangcheng; Chen, Weisi; Li, Chenlong; Yu, Chengyuan.

Sci Rep ; 14(1): 9543, 2024 04 25.

Artigo em Inglês | MEDLINE | ID: mdl-38664511

RESUMO

Depression, a pervasive global mental disorder, profoundly impacts daily lives. Despite numerous deep learning studies focused on depression detection through speech analysis, the shortage of annotated bulk samples hampers the development of effective models. In response to this challenge, our research introduces a transfer learning approach for detecting depression in speech, aiming to overcome constraints imposed by limited resources. In the context of feature representation, we obtain depression-related features by fine-tuning wav2vec 2.0. By integrating 1D-CNN and attention pooling structures, we generate advanced features at the segment level, thereby enhancing the model's capability to capture temporal relationships within audio frames. In the realm of prediction results, we integrate LSTM and self-attention mechanisms. This incorporation assigns greater weights to segments associated with depression, thereby augmenting the model's discernment of depression-related information. The experimental results indicate that our model has achieved impressive F1 scores, reaching 79% on the DAIC-WOZ dataset and 90.53% on the CMDC dataset. It outperforms recent baseline models in the field of speech-based depression detection. This provides a promising solution for effective depression detection in low-resource environments.

Assuntos

Aprendizado Profundo , Depressão , Fala , Humanos , Depressão/diagnóstico , Redes Neurais de Computação

2.

Exploring inter-trial coherence for inner speech classification in EEG-based brain-computer interface.

Lopez-Bernal, Diego; Balderas, David; Ponce, Pedro; Molina, Arturo.

J Neural Eng ; 21(2)2024 Apr 26.

Artigo em Inglês | MEDLINE | ID: mdl-38626760

RESUMO

Objective. In recent years, electroencephalogram (EEG)-based brain-computer interfaces (BCIs) applied to inner speech classification have gathered attention for their potential to provide a communication channel for individuals with speech disabilities. However, existing methodologies for this task fall short in achieving acceptable accuracy for real-life implementation. This paper concentrated on exploring the possibility of using inter-trial coherence (ITC) as a feature extraction technique to enhance inner speech classification accuracy in EEG-based BCIs.Approach. To address the objective, this work presents a novel methodology that employs ITC for feature extraction within a complex Morlet time-frequency representation. The study involves a dataset comprising EEG recordings of four different words for ten subjects, with three recording sessions per subject. The extracted features are then classified using k-nearest-neighbors (kNNs) and support vector machine (SVM).Main results. The average classification accuracy achieved using the proposed methodology is 56.08% for kNN and 59.55% for SVM. These results demonstrate comparable or superior performance in comparison to previous works. The exploration of inter-trial phase coherence as a feature extraction technique proves promising for enhancing accuracy in inner speech classification within EEG-based BCIs.Significance. This study contributes to the advancement of EEG-based BCIs for inner speech classification by introducing a feature extraction methodology using ITC. The obtained results, on par or superior to previous works, highlight the potential significance of this approach in improving the accuracy of BCI systems. The exploration of this technique lays the groundwork for further research toward inner speech decoding.

Assuntos

Interfaces Cérebro-Computador , Eletroencefalografia , Fala , Humanos , Eletroencefalografia/métodos , Eletroencefalografia/classificação , Masculino , Fala/fisiologia , Feminino , Adulto , Máquina de Vetores de Suporte , Adulto Jovem , Reprodutibilidade dos Testes , Algoritmos

3.

Conducting high-quality and reliable acoustic analysis: A tutorial focused on training research assistants.

Heller Murray, Elizabeth.

J Acoust Soc Am ; 155(4): 2603-2611, 2024 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-38629881

RESUMO

Open science practices have led to an increase in available speech datasets for researchers interested in acoustic analysis. Accurate evaluation of these databases frequently requires manual or semi-automated analysis. The time-intensive nature of these analyses makes them ideally suited for research assistants in laboratories focused on speech and voice production. However, the completion of high-quality, consistent, and reliable analyses requires clear rules and guidelines for all research assistants to follow. This tutorial will provide information on training and mentoring research assistants to complete these analyses, covering areas including RA training, ongoing data analysis monitoring, and documentation needed for reliable and re-creatable findings.

Assuntos

Distúrbios da Voz , Voz , Humanos , Acústica , Fala

4.

Lesions in a songbird vocal circuit increase variability in song syntax.

Koparkar, Avani; Warren, Timothy L; Charlesworth, Jonathan D; Shin, Sooyoon; Brainard, Michael S; Veit, Lena.

Elife ; 132024 Apr 18.

Artigo em Inglês | MEDLINE | ID: mdl-38635312

RESUMO

Complex skills like speech and dance are composed of ordered sequences of simpler elements, but the neuronal basis for the syntactic ordering of actions is poorly understood. Birdsong is a learned vocal behavior composed of syntactically ordered syllables, controlled in part by the songbird premotor nucleus HVC (proper name). Here, we test whether one of HVC's recurrent inputs, mMAN (medial magnocellular nucleus of the anterior nidopallium), contributes to sequencing in adult male Bengalese finches (Lonchura striata domestica). Bengalese finch song includes several patterns: (1) chunks, comprising stereotyped syllable sequences; (2) branch points, where a given syllable can be followed probabilistically by multiple syllables; and (3) repeat phrases, where individual syllables are repeated variable numbers of times. We found that following bilateral lesions of mMAN, acoustic structure of syllables remained largely intact, but sequencing became more variable, as evidenced by 'breaks' in previously stereotyped chunks, increased uncertainty at branch points, and increased variability in repeat numbers. Our results show that mMAN contributes to the variable sequencing of vocal elements in Bengalese finch song and demonstrate the influence of recurrent projections to HVC. Furthermore, they highlight the utility of species with complex syntax in investigating neuronal control of ordered sequences.

Assuntos

Aves Canoras , Masculino , Animais , Fala , Acústica , Memória , Comportamento Estereotipado

5.

SpEx: a German-language dataset of speech and executive function performance.

Camilleri, Julia A; Volkening, Julia; Heim, Stefan; Mochalski, Lisa N; Neufeld, Hannah; Schlothauer, Natalie; Kuhles, Gianna; Eickhoff, Simon B; Weis, Susanne.

Sci Rep ; 14(1): 9431, 2024 04 24.

Artigo em Inglês | MEDLINE | ID: mdl-38658576

RESUMO

This work presents data from 148 German native speakers (20-55 years of age), who completed several speaking tasks, ranging from formal tests such as word production tests to more ecologically valid spontaneous tasks that were designed to mimic natural speech. This speech data is supplemented by performance measures on several standardised, computer-based executive functioning (EF) tests covering domains of working-memory, cognitive flexibility, inhibition, and attention. The speech and EF data are further complemented by a rich collection of demographic data that documents education level, family status, and physical and psychological well-being. Additionally, the dataset includes information of the participants' hormone levels (cortisol, progesterone, oestradiol, and testosterone) at the time of testing. This dataset is thus a carefully curated, expansive collection of data that spans over different EF domains and includes both formal speaking tests as well as spontaneous speaking tasks, supplemented by valuable phenotypical information. This will thus provide the unique opportunity to perform a variety of analyses in the context of speech, EF, and inter-individual differences, and to our knowledge is the first of its kind in the German language. We refer to this dataset as SpEx since it combines speech and executive functioning data. Researchers interested in conducting exploratory or hypothesis-driven analyses in the field of individual differences in language and executive functioning, are encouraged to request access to this resource. Applicants will then be provided with an encrypted version of the data which can be downloaded.

Assuntos

Função Executiva , Fala , Humanos , Função Executiva/fisiologia , Adulto , Pessoa de Meia-Idade , Feminino , Masculino , Fala/fisiologia , Alemanha , Adulto Jovem , Idioma , Memória de Curto Prazo/fisiologia , Testes Neuropsicológicos

6.

The impact of face coverings on audio-visual contributions to communication with conversational speech.

Jackson, I R; Perugia, E; Stone, M A; Saunders, G H.

Cogn Res Princ Implic ; 9(1): 25, 2024 Apr 23.

Artigo em Inglês | MEDLINE | ID: mdl-38652383

RESUMO

The use of face coverings can make communication more difficult by removing access to visual cues as well as affecting the physical transmission of speech sounds. This study aimed to assess the independent and combined contributions of visual and auditory cues to impaired communication when using face coverings. In an online task, 150 participants rated videos of natural conversation along three dimensions: (1) how much they could follow, (2) how much effort was required, and (3) the clarity of the speech. Visual and audio variables were independently manipulated in each video, so that the same video could be presented with or without a superimposed surgical-style mask, accompanied by one of four audio conditions (either unfiltered audio, or audio-filtered to simulate the attenuation associated with a surgical mask, an FFP3 mask, or a visor). Hypotheses and analyses were pre-registered. Both the audio and visual variables had a statistically significant negative impact across all three dimensions. Whether or not talkers' faces were visible made the largest contribution to participants' ratings. The study identifies a degree of attenuation whose negative effects can be overcome by the restoration of visual cues. The significant effects observed in this nominally low-demand task (speech in quiet) highlight the importance of the visual and audio cues in everyday life and that their consideration should be included in future face mask designs.

Assuntos

Sinais (Psicologia) , Percepção da Fala , Humanos , Adulto , Feminino , Masculino , Adulto Jovem , Percepção da Fala/fisiologia , Percepção Visual/fisiologia , Máscaras , Adolescente , Fala/fisiologia , Comunicação , Pessoa de Meia-Idade , Reconhecimento Facial/fisiologia

7.

Linguistic based emotion analysis using softmax over time attention mechanism.

Roshan, Megha; Rawat, Mukul; Aryan, Karan; Lyakso, Elena; Mekala, A Mary; Ruban, Nersisson.

PLoS One ; 19(4): e0301336, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38625932

RESUMO

Recognizing the real emotion of humans is considered the most essential task for any customer feedback or medical applications. There are many methods available to recognize the type of emotion from speech signal by extracting frequency, pitch, and other dominant features. These features are used to train various models to auto-detect various human emotions. We cannot completely rely on the features of speech signals to detect the emotion, for instance, a customer is angry but still, he is speaking at a low voice (frequency components) which will eventually lead to wrong predictions. Even a video-based emotion detection system can be fooled by false facial expressions for various emotions. To rectify this issue, we need to make a parallel model that will train on textual data and make predictions based on the words present in the text. The model will then classify the type of emotions using more comprehensive information, thus making it a more robust model. To address this issue, we have tested four text-based classification models to classify the emotions of a customer. We examined the text-based models and compared their results which showed that the modified Encoder decoder model with attention mechanism trained on textual data achieved an accuracy of 93.5%. This research highlights the pressing need for more robust emotion recognition systems and underscores the potential of transfer models with attention mechanisms to significantly improve feedback management processes and the medical applications.

Assuntos

Emoções , Voz , Masculino , Humanos , Fala , Linguística , Reconhecimento Psicológico

8.

Disfluency in speech and language disorders.

Didirková, Ivana.

Clin Linguist Phon ; 38(4): 285-286, 2024 Apr 02.

Artigo em Inglês | MEDLINE | ID: mdl-38631031

Assuntos

Transtornos da Comunicação , Transtornos da Linguagem , Gagueira , Humanos , Fala , Distúrbios da Fala

9.

Treatment for Stuttering in Preschool-Age Children: A Qualitative Document Analysis of Treatment Programs.

Sjøstrand, Åse; Næss, Kari-Anne Bottegård; Melle, Ane Hestmann; Hoff, Karoline; Hansen, Elisabeth Holm; Guttormsen, Linn Stokke.

J Speech Lang Hear Res ; 67(4): 1020-1041, 2024 Apr 08.

Artigo em Inglês | MEDLINE | ID: mdl-38557114

RESUMO

PURPOSE: The purpose of this study was to identify commonalities and differences between content components in stuttering treatment programs for preschool-age children. METHOD: In this document analysis, a thematic analysis of the content was conducted of handbooks and manuals describing Early Childhood Stuttering Therapy, the Lidcombe Program, Mini-KIDS, Palin Parent-Child Interaction Therapy, RESTART Demands and Capacities Model Method, and the Westmead Program. First, a theoretical framework defining a content component in treatment was developed. Second, we coded and categorized the data following the procedure of reflexive thematic analysis. In addition, the first authors of the treatment documents have reviewed the findings in this study, and their feedback has been analyzed and taken into consideration. RESULTS: Sixty-one content components within the seven themes-interaction, coping, reactions, everyday life, information, language, and speech-were identified across the treatment programs. The content component SLP providing information about the child's stuttering was identified across all treatment programs. All programs are multithematic, and no treatment program has a single focus on speech, language, or parent-child interaction. A comparison of the programs with equal treatment goals highlighted more commonalities in content components across the programs. The differences between the treatment programs were evident in both the number of content components that varied from seven to 39 and the content included in each treatment program. CONCLUSIONS: Only one common content component was identified across programs, and the number and types of components vary widely. The role that the common content component plays in treatment effects is discussed, alongside implications for research and clinical practice. SUPPLEMENTAL MATERIAL: https://doi.org/10.23641/asha.25457929.

Assuntos

Gagueira , Humanos , Pré-Escolar , Gagueira/terapia , Fonoterapia/métodos , Análise Documental , Resultado do Tratamento , Fala

10.

Cortical specialization associated with native speech category acquisition in early infancy.

Ren, Jie; Cai, Lin; Jia, Gaoding; Niu, Haijing.

Cereb Cortex ; 34(4)2024 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-38566511

RESUMO

This study investigates neural processes in infant speech processing, with a focus on left frontal brain regions and hemispheric lateralization in Mandarin-speaking infants' acquisition of native tonal categories. We tested 2- to 6-month-old Mandarin learners to explore age-related improvements in tone discrimination, the role of inferior frontal regions in abstract speech category representation, and left hemisphere lateralization during tone processing. Using a block design, we presented four Mandarin tones via [ta] and measured oxygenated hemoglobin concentration with functional near-infrared spectroscopy. Results showed age-related improvements in tone discrimination, greater involvement of frontal regions in older infants indicating abstract tonal representation development and increased bilateral activation mirroring native adult Mandarin speakers. These findings contribute to our broader understanding of the relationship between native speech acquisition and infant brain development during the critical period of early language learning.

Assuntos

Percepção da Fala , Fala , Adulto , Lactente , Humanos , Idoso , Percepção da Fala/fisiologia , Percepção da Altura Sonora/fisiologia , Desenvolvimento da Linguagem , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia

11.

Sex differences in vocal behavior in virtual rooms compared to real rooms.

Papadimitriou, Georgios; Brunskog, Jonas; Heuchel, Franz M; Lyberg Åhlander, Viveka; Öhlund Wistbacka, Greta.

JASA Express Lett ; 4(4)2024 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-38568027

RESUMO

This study investigates speech production under various room acoustic conditions in virtual environments, by comparing vocal behavior and the subjective experience of speaking in four real rooms and their audio-visual virtual replicas. Sex differences were explored. Males and females (N = 13) adjusted their voice levels similarly to room acoustic changes in the real rooms, but only males did so in the virtual rooms. Females, however, rated the visual virtual environment as more realistic compared to males. This suggests a discrepancy between sexes regarding the experience of realism in a virtual environment and changes in objective behavioral measures such as voice level.

Assuntos

Caracteres Sexuais , Fala , Feminino , Masculino , Humanos , Acústica

12.

BioBBC: a multi-feature model that enhances the detection of biomedical entities.

Alamro, Hind; Gojobori, Takashi; Essack, Magbubah; Gao, Xin.

Sci Rep ; 14(1): 7697, 2024 04 02.

Artigo em Inglês | MEDLINE | ID: mdl-38565624

RESUMO

The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.

Assuntos

Idioma , Semântica , Processamento de Linguagem Natural , Benchmarking , Fala

13.

English vowel recognition in multi-talker babbles mixed with different numbers of talkersa).

Wang, Xianhui; Xu, Li.

JASA Express Lett ; 4(4)2024 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-38573045

RESUMO

The present study examined English vowel recognition in multi-talker babbles (MTBs) in 20 normal-hearing, native-English-speaking adult listeners. Twelve vowels, embedded in the h-V-d structure, were presented in MTBs consisting of 1, 2, 4, 6, 8, 10, and 12 talkers (numbers of talkers [N]) and a speech-shaped noise at signal-to-noise ratios of -12, -6, and 0 dB. Results showed that vowel recognition performance was a non-monotonic function of N when signal-to-noise ratios were less favorable. The masking effects of MTBs on vowel recognition were most similar to consonant recognition but less so to word and sentence recognition reported in previous studies.

Assuntos

Idioma , Fala , Adulto , Humanos , Reconhecimento Psicológico , Razão Sinal-Ruído

14.

Superiorly Based Posterior Pharyngeal Flaps: Using A Care Pathway to Optimize Speech and Airway Outcomes.

Butterfield, James; Pencek, Megan; Sweitzer, Keith; Marrinan, Eileen; Connolly, Heidi; Neimanis, Sara; Morrison, Clinton.

Ann Plast Surg ; 92(4S Suppl 2): S101-S104, 2024 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-38556656

RESUMO

BACKGROUND: Pharyngeal flap (PF) surgery is effective at improving velopharyngeal sufficiency, but historical literature shows a concerning prevalence rate of obstructive sleep apnea (OSA), reported as high as 20%. Our institution has developed a protocol to minimize risk of postoperative obstructive complications and increase safety of PF surgery. We hypothesize that (1) preoperative staged removal of significant adenotonsillar tissue along with (2) multiview videofluoroscopy to guide patient-specific surgical approach via appropriately sized PFs can result in excellent speech outcomes while limiting occurrence of OSA. METHODS: This was a retrospective chart review of all patients with velopharyngeal insufficiency (VPI) (aged 2-20 years) seen at the University of Rochester from 2015 to 2022 undergoing PF surgery to correct VPI. Nasopharyngoscopy was used for surgical planning and airway evaluation. Patients with tonsillar and adenoid hypertrophy underwent staged adenotonsillectomy at least 2 months before PF. Multiview videofluoroscopy was used to identify anatomic causes of VPI and to determine PF width. Patients underwent polysomnography and speech evaluation before and at least 6 months after PF surgery. RESULTS: Forty-one children aged 8.5 ± 4.1 years (range, 4 to 18 years) who underwent posterior PF surgery for VPI were identified. This included 10 patients with 22q11.2 deletion and 4 patients with Pierre Robin sequence. Thirty-nine patients had both pre- and postoperative speech data and underwent both a pre- and postoperative sleep study. Polysomnography showed no significant difference in obstructive apnea-hypopnea index after posterior PF surgery (obstructive apnea-hypopnea index preop, 1.3 ± 1.2 events per hour; postop, 1.7 ± 2.1 events per hour; P = 0.111). Significant improvements in speech outcome were seen in patients who underwent PF (modified Pittsburgh score preop, 11.52 ± 1.37; postop, 1.09 ± 2.35; P < 0.05). CONCLUSIONS: Use of preoperative staged adenotonsillectomy as well as patient-specific PF dimensions results in effective resolution of VPI and a low risk of OSA.

Assuntos

Apneia Obstrutiva do Sono , Insuficiência Velofaríngea , Criança , Humanos , Fala , Estudos Retrospectivos , Procedimentos Clínicos , Faringe/cirurgia , Insuficiência Velofaríngea/cirurgia , Insuficiência Velofaríngea/complicações , Apneia Obstrutiva do Sono/etiologia , Complicações Pós-Operatórias/epidemiologia , Resultado do Tratamento

15.

The influence of listener experience, measurement scale and speech task on the reliability of auditory-perceptual evaluation of vocal quality.

Alves, Jônatas do Nascimento; Almeida, Anna Alice Figueiredo de; Yamasaki, Rosiane; Lopes, Leonardo Wanderley.

Codas ; 36(3): e20230175, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38629682

RESUMO

PURPOSE: To assess the influence of the listener experience, measurement scales and the type of speech task on the auditory-perceptual evaluation of the overall severity (OS) of voice deviation and the predominant type of voice (rough, breathy or strain). METHODS: 22 listeners, divided into four groups participated in the study: speech-language pathologist specialized in voice (SLP-V), SLP non specialized in voice (SLP-NV), graduate students with auditory-perceptual analysis training (GS-T), and graduate students without auditory-perceptual analysis training (GS-U). The subjects rated the OS of voice deviation and the predominant type of voice of 44 voices by visual analog scale (VAS) and the numerical scale (score "G" from GRBAS), corresponding to six speech tasks such as sustained vowel /a/ and /É/, sentences, number counting, running speech, and all five previous tasks together. RESULTS: Sentences obtained the best interrater reliability in each group, using both VAS and GRBAS. SLP-NV group demonstrated the best interrater reliability in OS judgment in different speech tasks using VAS or GRBAS. Sustained vowel (/a/ and /É/) and running speech obtained the best interrater reliability among the groups of listeners in judging the predominant vocal quality. GS-T group got the best result of interrater reliability in judging the predominant vocal quality. CONCLUSION: The time of experience in the auditory-perceptual judgment of the voice, the type of training to which they were submitted, and the type of speech task influence the reliability of the auditory-perceptual evaluation of vocal quality.

Assuntos

Disfonia , Percepção da Fala , Humanos , Fala , Reprodutibilidade dos Testes , Medida da Produção da Fala , Variações Dependentes do Observador , Qualidade da Voz , Acústica da Fala

16.

Attention Mobilization as a Modulator of Listening Effort: Evidence From Pupillometry.

Johns, M A; Calloway, R C; Karunathilake, I M D; Decruy, L P; Anderson, S; Simon, J Z; Kuchinsky, S E.

Trends Hear ; 28: 23312165241245240, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38613337

RESUMO

Listening to speech in noise can require substantial mental effort, even among younger normal-hearing adults. The task-evoked pupil response (TEPR) has been shown to track the increased effort exerted to recognize words or sentences in increasing noise. However, few studies have examined the trajectory of listening effort across longer, more natural, stretches of speech, or the extent to which expectations about upcoming listening difficulty modulate the TEPR. Seventeen younger normal-hearing adults listened to 60-s-long audiobook passages, repeated three times in a row, at two different signal-to-noise ratios (SNRs) while pupil size was recorded. There was a significant interaction between SNR, repetition, and baseline pupil size on sustained listening effort. At lower baseline pupil sizes, potentially reflecting lower attention mobilization, TEPRs were more sustained in the harder SNR condition, particularly when attention mobilization remained low by the third presentation. At intermediate baseline pupil sizes, differences between conditions were largely absent, suggesting these listeners had optimally mobilized their attention for both SNRs. Lastly, at higher baseline pupil sizes, potentially reflecting overmobilization of attention, the effect of SNR was initially reversed for the second and third presentations: participants initially appeared to disengage in the harder SNR condition, resulting in reduced TEPRs that recovered in the second half of the story. Together, these findings suggest that the unfolding of listening effort over time depends critically on the extent to which individuals have successfully mobilized their attention in anticipation of difficult listening conditions.

Assuntos

Esforço de Escuta , Pupila , Adulto , Humanos , Razão Sinal-Ruído , Fala

17.

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation.

Naeini, Saeid Alavi; Simmatis, Leif; Jafari, Deniz; Yunusova, Yana; Taati, Babak.

IEEE J Transl Eng Health Med ; 12: 382-389, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38606392

RESUMO

Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.

Assuntos

Percepção da Fala , Fala , Humanos , Interface para o Reconhecimento da Fala , Disartria/diagnóstico , Distúrbios da Fala

18.

Efficient Speech Detection in Environmental Audio Using Acoustic Recognition and Knowledge Distillation.

Priebe, Drew; Ghani, Burooj; Stowell, Dan.

Sensors (Basel) ; 24(7)2024 Mar 22.

Artigo em Inglês | MEDLINE | ID: mdl-38610256

RESUMO

The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analyzing human disturbance and for privacy filtering. Despite significant strides in deep learning in recent years, the deployment of large neural networks on compact devices poses challenges due to memory and latency constraints. Our approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics. In particular, we employed the MobileNetV3-Small-Pi model to create compact yet effective student architectures to compare against the larger EcoVAD teacher model, a well-regarded voice detection architecture in eco-acoustic monitoring. The comparative analysis included examining various configurations of the MobileNetV3-Small-Pi-derived student models to identify optimal performance. Additionally, a thorough evaluation of different distillation techniques was conducted to ascertain the most effective method for model selection. Our findings revealed that the distilled models exhibited comparable performance to the EcoVAD teacher model, indicating a promising approach to overcoming computational barriers for real-time ecological monitoring.

Assuntos

Fala , Voz , Humanos , Acústica , Biodiversidade , Conhecimento

19.

Perception of vocoded speech in domestic dogs.

Mallikarjun, Amritha; Shroads, Emily; Newman, Rochelle S.

Anim Cogn ; 27(1): 34, 2024 Apr 16.

Artigo em Inglês | MEDLINE | ID: mdl-38625429

RESUMO

Humans have an impressive ability to comprehend signal-degraded speech; however, the extent to which comprehension of degraded speech relies on human-specific features of speech perception vs. more general cognitive processes is unknown. Since dogs live alongside humans and regularly hear speech, they can be used as a model to differentiate between these possibilities. One often-studied type of degraded speech is noise-vocoded speech (sometimes thought of as cochlear-implant-simulation speech). Noise-vocoded speech is made by dividing the speech signal into frequency bands (channels), identifying the amplitude envelope of each individual band, and then using these envelopes to modulate bands of noise centered over the same frequency regions - the result is a signal with preserved temporal cues, but vastly reduced frequency information. Here, we tested dogs' recognition of familiar words produced in 16-channel vocoded speech. In the first study, dogs heard their names and unfamiliar dogs' names (foils) in vocoded speech as well as natural speech. In the second study, dogs heard 16-channel vocoded speech only. Dogs listened longer to their vocoded name than vocoded foils in both experiments, showing that they can comprehend a 16-channel vocoded version of their name without prior exposure to vocoded speech, and without immediate exposure to the natural-speech version of their name. Dogs' name recognition in the second study was mediated by the number of phonemes in the dogs' name, suggesting that phonological context plays a role in degraded speech comprehension.

Assuntos

Percepção da Fala , Fala , Humanos , Animais , Cães , Sinais (Psicologia) , Audição , Linguística

20.

Associations between EEG power and coherence with cognition and early precursors of speech and language development across the first months of life.

Bradley, Holly; Smith, Beth A; Xiao, Ran.

PLoS One ; 19(4): e0300382, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38625991

RESUMO

The neural processes underpinning cognition and language development in infancy are of great interest. We investigated EEG power and coherence in infancy, as a reflection of underlying cortical function of single brain region and cross-region connectivity, and their relations to cognition and early precursors of speech and language development. EEG recordings were longitudinally collected from 21 infants with typical development between approximately 1 and 7 months. We investigated relative band power at 3-6Hz and 6-9Hz and EEG coherence of these frequency ranges at 25 electrode pairs that cover key brain regions. A correlation analysis was performed to assess the relationship between EEG measurements across frequency bands and brain regions and raw Bayley cognitive and language developmental scores. In the first months of life, relative band power is not correlated with cognitive and language scales. However, 3-6Hz coherence is negatively correlated with receptive language scores between frontoparietal regions, and 6-9Hz coherence is negatively correlated with expressive language scores between frontoparietal regions. The results from this preliminary study contribute to the existing literature on the relationship between electrophysiological development, cognition, and early speech precursors in this age group. Future work should create norm references of early development in these domains that can be compared with infants at risk for neurodevelopmental disabilities.

Assuntos

Eletroencefalografia , Fala , Lactente , Humanos , Eletroencefalografia/métodos , Desenvolvimento da Linguagem , Cognição/fisiologia , Encéfalo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA